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Abstract 

This paper presents a memory efficient, high throughput parallel lifting based running three dimensional discrete 
wavelet transform (3-D DWT) architecture. 3-D DWT is constructed by combining the two spatial and four temporal 
processors. Spatial processor (SP) apply the two dimensional DWT on a frame, using lifting based 9/7 filter bank 
through the row rocessor (RP) in row direction and then apply in the colum direction through column processor 
(CP). To reduce the temporal memory and the latency, the temporal processor (TP) has been designed with lifting 
based 1-D Haar wavelet filter. The proposed architecture replaced the multiplications by pipeline shift-add operations 
to reduce the CPD. Two spatial processors works simultaneously on two adjacent frames and provide 2-D DWT 
coefficients as inputs to the temporal processors. TPs apply the one dimensional DWT in temporal direction and 
provide eight 3-D DWT coefficients per clock (throughput). Higher throughput reduces the computing cycles per 
frame and enable the lower power consumption. Implementation results shows that the proposed architecture has 
the advantage in reduced memory, low power consumption, low latency, and high throughput over the existing 
designs. The RTL of the proposed architecture is described using verilog and synthesized using 90-nm technology 
CMOS standard cell library and results show that it consumes 43.42 mW power and occupies an area equivalent 
to 231.45 K equivalent gate at frequency of 200 MHz. The proposed architecture has also been synthesised for the 
Xilinx zynq 7020 series field programmable gate array (FPGA). 

Index Terms 

Index Terms : discrete wavelet transform, 3-D DWT, lifting based DWT, VLSI Architecture, flipping structure, 
strip-based scanning. 


I. Introduction 

Video compression is a major requirement in many of the recent applications like medical imaging, 
studio applications and broadcasting applications. Compression ratio of the encoder completely depends 
on the underlying compression algorithms. The goal of compression techniques is to reduce the immense 
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amount of visual information to a manageable size so that it can be efficiently stored, transmitted, and 
displayed. 3-D DWT based compressing system enables the compression in spatial as well as temporal 
direction which is more suitable for video compression. Moreover, wavelet based compression provide 
the scalability with the levels of decomposition. Due to continuous increase in size of the video frames 
(HD to UHD), video processing through software coding tools is more complex. Dedicated hardware 
only can give higher performance for high resolution video processing. In this scenario there is a strong 
requirement to implement a VLSI architecture for efficient 3-D DWT processor, which consumes less 
power, area efficient, memory efficient and should operate with a higher frequency to use in real-time 
applications. 

From the last two decades, several hardware designs have been noted for implementation of 2-D DWT 
and 3-D DWT for different applications. Majority of the designs are developed based on three categories, 
viz. (i) convolution based (ii) lifting-based and (hi) B-Spline based. Most of the existing architectures are 
facing the difficulty with larger memory requirement, lower throughput, and complex control circuit. 
In general the circuit complexity is denoted by two major components viz, arithmetic and Memory 
component. Arithmetic component includes adders and multipliers, whereas memory component consists 
of temporal memory and transpose memory. Complexity of the arithmetic components is fully depends 
on the DWT filter length. In contrast size of the memory component is depends on dimensions of the 
image. As image resolutions are continuously increasing (HD to UHD), image dimensions are very high 
compared to filter length of the DWT, as a result complexity of the memory component occupied major 
share in the overall complexity of DWT architecture. 

Convolution based implementations ||Tl-[|3l| provides the outputs within less time but require high 
amount of arithmetic resources, memory intensive and occupy larger area to implement. Lifting based a 
implementations requires less memory, less arithmetic complex and possibility to implement in parallel. 
However it require long critical path, recently huge number of contributions are noted to reduce the critical 
path in lifting based implementations. For a general lifting based structure DU provides critical path of 
4Tm -f 8Ta, by introducing 4 stage pipeline it cut down to Tm + 2Ta. In DSl Huang et ah, introduced a 
flipping structure it further reduced the critical path to T^ + Ta. Though, it reduced the critical path delay 
in lifting based implementation, it requires to improve the memory efficiency. Majority of the designs 
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implement the 2-D DWT, first by applying 1-D DWT in row-wise and then apply 1-D DWT in eolumn 
wise. It require huge amount of memory to store these intermediate coeffieients. To reduce this memory 
requirements, several DWT architecture have been proposed by using line based scanning methods iTTl - 
IfTTlI . Huang et ah, ||71-[[8l give brief details of B-Spline based 2-D IDWT implementation and discussed 
the memory requirements for different scan techniques and also proposed a efficient overlapped strip-based 
scanning to reduce the internal memory size. Several parallel architectures were proposed for lifting-based 
2-D DWT [|8ll- [fT71 . Y. Hu et al. [fTTlI . proposed a modified strip based scanning and parallel architecture 
for 2-D DWT is the best memory-efficient design among the existing 2-D DWT architectures, it requires 
only 3N -i- 24P of on chip memory for a NxN image with P parallel processing units (PU). Several 
lifting based 3-D DWT architectures are noted in the literature ffT8ll - [[24ll to reduce the critical path of the 
1-D DWT architecture and to decrease the memory requirement of the 3-D architecture. Among the best 
existing designs of 3-D DWT, Darji et al. [[24l produced best results by reducing the memory requirements 
and gives the throughput of 4 results/cycle. Still it requires the large on-chip memory (4A^^ -f lOA^). 

In this paper, we propose a new parallel and memory efficient lifting based 3-D DWT architecture, 
requires only 2*(3A^-|-60P)-|-48 words of on-chip memory and produce 8 results/cycle. The proposed 3-D 
DWT architecture is built with two spatial 2-D DWT (CDF 9/7) processors and four temporal 1-D DWT 
(Haar) processors. Proposed architecture for 3-D DWT replaced the multiplication operations by shift and 
add, it reduce the CPD from Tm + Ta to 4Ta. Further reduction of CPD to Ta is done by introducing 
pipeline in the processing elements. To eliminate the temporal memory and to reduce the latency, Haar 
wavelet is incorporated in temporal processor. The resultant architecture has reduce the latency, on chip 
memory and to increase the speed of operation compared to existing 3-D DWT designs. The following 
sections provide the architectural details of proposed 3-D DWT through spatial and temporal processors. 

Organization of the paper as follows. Theoretical background for DWT is given in section II. Detailed 
description of the proposed architecture for 3-D DWT is provided in section III. Implementation results 
and performance comparison is given in Section IV. Finally, concluding remarks are given in Section V. 

H. Theoretical background 

Lifting based wavelet transform designed by using a series of matrix decomposition specified by the 
Daubechies and Sweledens in [|4l|. By applying the flipping [O to the lifting scheme, the multipliers in the 
longest delay path are eliminated, resulting in a shorter critical path. The original data on which DWT is 
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applied is denoted by X[n], and the 1-D DWT outputs are the detail coefficients H[n] and approximation 
coefficients L[n]. For the Image (2-D) above process is performed in rows and columns as well. Eqns.(l)- 
(6) are the design equations for flipping based lifting (9/7) 1-D DWT [[^ and the same equations are used 
to implement the proposed row processor (1-D DWT) and column processor (1-D DWT). 
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Where a' = 1/a, b' = l/a/i, c' = 1//37, d! = 1/7(5, TTO = a/ 57 /C, and K1 = a/d'ydC [|4l. The lifting 
step coefficients a, / 5 , 7, 5 and scaling coefficient are constants and its values a = —1.586134342, 
(3 = -0.052980118, 7 = 0.8829110762, and 8 = 0.4435068522, and C = 1.149604398. 


Lifting based wavelets are always memory efficient and easy to implement in hardware. The lifting 
scheme consists of three steps to decompose the samples, namely, splitting, predicting (eqn. (1) and (3)), 
and updating (eqn. (2) and (4)). 

Haar wavelet transform is orthogonal and simple to construct and provide fast output. By considering 
the advantages of the Haar wavelets, the proposed architecture uses the Haar wavelet to perform the 1-D 
DWT in temporal direction (between two adjacent frames). Sweldens et al. [|^ developed a lifting based 
Haar wavelet. The equations of the lifting scheme for the Haar wavelet transform is as shown in eqn.Q 
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H 
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( 8 ) 


i = + A'l) 

Eqn.([^ is extracted by substituting Predict value P{z) as 1 and Update step S{z) value as 1/2 in eqn.Q, 
which is used to develop the temporal processor to apply 1-D DWT in temporal direction (d'"^ dimension). 
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Figure 1. Block diagram for 3-D DWT 


Where L and H are the low and High frequency coefficients respectively. 

III. Proposed architecture for 3-D DWT 

The proposed architecture for 3-D DWT comprising of two parallel spatial processors (2-D DWT) and 
four temporal processors (1-D DWT), is depicted in Fig. After applying 2-D DWT on two consecutive 
frames, each spatial processor (SP) produces 4 sub-bands, viz. LL, HL, LH and HH and are fed to the 
inputs of four temporal processors (TPs) to perform the temporal transform. Output of these TPs is a 
low frequency frame (L-frame) and a high frequency frame (H-frame). Architectural details of the spatial 
processor and temporal processors are discussed in the following sections. 

A. Architecture for Spatial Processor 

In this section, we propose a new parallel and memory efficient lifting based 2-D DWT architecture 
denoted by spatial processor (SP) and it consists of row and column processors. The proposed SP is a 
revised version of the architecture developed by the Y. Hu et al. lUTIl . The proposed architecture utili z es 
the strip based scanning IfTTlI to enable the trade-off between external memory and internal memory. To 
reduce the critical path in each stage flipping model [|5l|-[l6l is used to develop the processing element 
(PE). Each PE has been developed with shift and add techniques in place of multiplier. Eifting based 
(9/7) 1-D DWT process has been performed by the processing unit (PU) in the proposed architecture. As 
shown in Pig. the proposed PU is designed with five PEs, and each PE (except first PE (shift PE)) 
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Figure 2. (a) Data Flow Graph of Processing Unit (b) Processing Unit with five pipeline stages (c) Processing Unit with nine pipeline stages 


has been constructed with two pipeline stages for further reduction of CPD. This modified PU, reduces 
the CPD to Ta (adder delay). Fig. shows that the number of inputs to the spatial processor is equal to 
2P+1, which is also equal to the width of the strip. Where P is the number of parallel processing units 
(PUs) in the row processor as well as column processor. We have designed the proposed architecture 
with two parallel processing units (P = 2). The same structure can be extended to P = 4, 8, 16 or 32 
depending on external bandwidth. Whenever row processor produces the intermediate results, immediately 
column processor start to process on those intermediate results. Row processor takes 9 clocks to produce 
the temporary results then after column processor takes 9 more clocks to to give the 2-D DWT output; 
finally, temporal processor takes 3 more clocks after 2-D DWT results are available to produce 3-D DWT 
output. As a summary, proposed 2-D DWT and 3-D DWT architectures have constant latency of 18 and 
21 clock cycles respectively, regardless of image size N and number of parallel PUs (P). Details of the 
row processor and column processor are given in the following sub-sections. 

1) Row Processor (RP): Let X be the image of size N x N, extend this image by one column by 
using symmetric extension. Now image size is N x (A^ -f 1). Refer liTVll for the structure of strip based 
scanning method. The proposed architecture initiates the DWT process in row wise through row processor 
(RP) then process the column DWT by column processor (CP). Fig. |^a). shows the generalized structure 
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Figure 3. (a)Row Processor (b) Column Processor 



Figure 4. (a) Transpose Register (Ref: ll7l ) (b) Re-arrange Unit 






































































































































































































































for a row processor with P number of PUs. P = 2 has been eonsidered for our proposed design. For 
the first elock cyele, RP get the pixels from X(0,0) to X(0,2P) simultaneously. For the seeond clock 
RP gets the pixels from next row i.e. X(l, 0) to X(l, 2P), the same proeedure eontinues for eaeh clock 
till it reaches the bottom row i.e., X(iV, 0) to X{N,2P). Then it goes to the next strip and RP get the 
pixels from X(0,2P) to X(0,4P) and it eontinues this proeedure for entire image. Eaeh PU eonsists 
of five pipeline stages and eaeh pipeline stage is proeessed by one proeessing element (PE) as depleted 
in Fig. [^b). First stage (shift_PE) provide the partial results which is required at 2"“^ stage (PE_alpha), 
likewise processing elements PE alpha to PE delta (2"'^ stage to 5*^ stage) gives the partial results along 
with their original outputs, (e.g., eonsider the PE alpha of PU-1, it needs to provide output corresponding 
to eqn.(l) (Hi[n\), along with Hi[n\, it also provides the partial output X'[2n] whieh is required for the 
PE beta). Strueture of the PEs are given in the Eig. |^b), it shows that multiplieation is replaeed with 
the shift and add teehnique. The original multiplication factor and the value through the shift and add 
circuit are noted in Tablej^ it shows that variation between original and adopted one is extremely small. 
As shown in Eig. [^b), time delay of shift PE is one and remaining all PEs are having delay of 2Ta. 
To reduee the CPD of PU, PEs from PE alpha to PE delta are divided in to two pipeline stages, and 
eaeh pipeline stage has a delay of Tq, as a result CPD of PU is redueed to Ta and pipeline stages are 
inereased to nine and is shown in Pig. |^c). The outputs Hi[n + P — 1], Li[n + P — 1], and H 2 [n + P — 1] 
corresponding to PE alpha and PE beta of last PU and PE gama of last PU is saved in the memories 
Memory alpha. Memory beta and Memory gama respeetively, shown in Pig. |^a). Those stored outputs 
are inputted for next subsequent eolumns of the same row. Por a. N x N image rows is equivalent to N. 
So the size of the eaeh memory is iV x 1 words and total row memory to store these outputs is equals 
to 3N. Output of eaeh PU are under gone through a proeess of sealing before it produeing the outputs 
H and E. These outputs are fed to the transposing unit. The transpose unit has P number of transpose 
registers (one for each PU). Pig. [Ja) shows the structure of transpose register, and it gives the two H and 
two E data alternatively to the column processor. 

2) Column Processor (CP): The strueture of the eolumn proeessor (CP) is shown in Pig. [^b). To 
mateh with the throughput of RP, CP is also designed with two number of PUs in our arehiteeture. Eaeh 
transpose register produees a pair of H and E in an alternative order and are fed to the inputs of one PU 
of the CP. The partial results produced are consumed by the next PE after two elock cycles. As sueh, shift 
registers of length two are needed within the CP between each pipeline stages for eaehing the partial results 
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Table I 

Original and adopted values eor multiplication 


PE 

Original 

Multiplier 

Value 

Multiplier 
value through 
shift and add 

PEalpha 

a'=-0.6305 

a'=-0.6328 

PEbeta 

L= 11.90 

6'=12 

PE gama 

c'=-21.378 

c'=-21.375 

PEdelta 

d'=2.55 

d'=2.5625 


(except between and 2”*^ pipeline stages). At the output of the CP, four sub-bands are generated in an 
interleaved pattern, i.e.{HL^HH)^{LL^LH)^{HL^HH)^{LL^LH), and so on. Outputs of the CP are 
fed to the re-arrange unit. Fig. |^b) shows the architecture for re-arrange unit, and it provides the outputs 
in sub-band order i.e.LL, LH, HL and HH simultaneously, by using P registers and 2P multiplexers. 
For multilevel decomposition, the same DWT core can be used in a folded architecture with an external 
frame buffer for the LL sub-band coefficients. 

B. Architecture for Temporal Processor (TP) 

Eqn.([^ shows that Haar wavelet transform depends on two adjacent pixels values (same pixel position 
of adjacent frames, for temporal processing). As soon as spatial processors are provide the 2-D DWT 
results, temporal processors starts processing on the spatial processor outputs (2-D DWT results) and 
produce the 3-D DWT results. Fig. shows that there is no requirement of temporal buffer, due to the 
sub-band coefficients of two spatial processors are directly connected to the four temporal processors. But 
it has been designed with 3 pipeline stages, it require 6 pipeline registers for each TP. Same frequency 
sub-band of the distinct spatial processors are fed to the each temporal processor, i.e. LL, HL, LH and 
HH sub-bands of the spatial processor 1 and 2 are given as inputs to the temporal processor 1, 2, 3 and 
4 respectively. Temporal processor apply 1-D Haar wavelet on sub-band coefficients, and provide the low 
frequency sub-band and high frequency sub-band as output. By combining all low frequency sub-bands 
and high frequency sub-bands of all temporal processors provide the 3-D DWT output in the form of 
L-Frame and H-Frame (2-D DWT by spatial processors and 1-D DWT by temporal processors). 

IV. Implementation Results and Performance Comparison 

The proposed 3-D DWT architecture has been described in Verilog HDL. A uniform word length of 
14 bits has been maintained throughout the design. Simulation results have been verified by using Xilinx 
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Table II 

Device utilisation summary of the proposed architecture 


Logic utilized 

Used 

Available 

Utilization 

Slice Registers 

1958 

106400 

1 % 

Number of Slice LUTs 

2852 

53200 

5% 

Number of fully 
used LUT-FF pairs 

1137 

3673 

30% 

Number of Block RAM 

3 

140 

2 % 


Table III 

Comparison of proposed 2-D DWT architecture with existing architectures (for 1-level) 


Parameter 

Zhang Oilll 

Mohanty I113II 

Darji II14II 

Yusong nnii 

Proposed 

Multipliers 

10 

9P 

10 

lOP 

0 

Adders 

16 

16P 

16 

16P 

34P 

Internal Memory 

4N-I-37 

15P+5.5N 

4N 

24P-I-3N 

60P+3N 

Critieal path 

T 

m 

7" _|_ 2T 

-‘-m ' a 

T 

m 

T -\-T 

T 

Computation Time 

~7FJ2 

N'^/2P 

~JFJ2 

N'^/2P 

~WJ2P 

Throughput 

2/T 

-‘-m 

+ ‘^Ta 

2/T 

+ Ta 

2 P/r, 


Table IV 

Comparison of proposed 3-D DWT architecture with existing architectures (for I-level) 


Parameters 

Weeks 119] 

Taghavi 120] 

A.Das 122] 

Darji 1241 

Proposed 

Memory requirement 

6A^^-i-6/ 


5N'^ + 5N 

4A^ -F ION 

2*(3N-f60P)-f48 

Throughput/cycle 

- 

1 result 

2 results 

4 results 

8 results 

Computing time 

For 2 Frames 

2N^ + 31/2 

6N^ 

3N^ 

3A2 

A2/2P 

Latency 

2.5N'^ + 0.51 

4A^ cycles 

2N'^ cycles 

3N'^I2 cycles 

21 cycles 

Area 

- 

- 

1825 slices 

2490 slices 

2852 slice LUTs 

Operating 

Frequency 

200 MHz (ASIC) 

- 

321 MHz 
(FPGA) 

91.87 MHz 
(FPGA) 

265 MHz 
(FPGA) 

Multipliers 

- 

- 

Nil 

30 

Nil 

Adders 

61 MACS 

- 

78 

48 

168 

Filter bank 

(-length 

D-9/7 

D-9/7 

D-9/7 

D-9/7 (2-D) -F Haar (1-D) 


Table V 

Synthesis Results (Design Vision) Comparison of Proposed 3-D DWT architecture with existing 


Parameters 

Darji et ah. 12411 

Proposed 

Comb. Area 

61351 

526419 

Non Comb. Area 

807223 /im^ 

553078 

Total Cell Area 

868574 /im^ 

1079498 /im^ 

Operating Voltage 

1.98 V 

1.2 V 

Total Dynamic Power 

179.75 mW 

38.56 mW 

Cell Leakage Power 

46.87 /iFF 

4.86 mW 
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ISE simulator. We have simulated the Matlab model whieh is similar to the proposed 3-D DWT hardware 
architeeture and verified the 3-D DWT eoeffieients. RTL simulation results have been found to exaetly 
match the Matlab simulation results. The Verilog RTL code is synthesized using Xilinx ISE 14.2 tool 
and mapped to a Xilinx programmable device (EPGA) 7z020clg484 (zynq board) with speed grade of 
-3. Table shows the device utilization summary of the proposed architecture and it operates with a 
maximum frequency of 265 MHz. 

The proposed architecture has also been synthesized using SYNOPSYS design compiler with 90-nm 
technology CMOS standard cell library. It consumes 43.42 mW power and occupies an area equivalent 
to 231.45 K equivalent gate at frequency of 200 MHz. 


A. Comparison 

The performance comparison of the proposed 2-D and 3-D DWT architectures with other existing 
architectures is figure out in Tables III and respectively. The proposed 2-D processor requires zero 
multipliers, 34P (Pis number of parallel PUs) adders, 60P-I-3N internal memory. It has a critical path delay 
of Ta with a throughput of four outputs per cycle with iV^/2P computation cycles to process an image 
with size N x N. When compared to recent 2-D DWT architecture developed by the YHu et al. ifTTIl . 
CPD reduced to from -f Ta with the cost of small increase in hardware resources. 

Table [TVl shows the comparison of proposed 3-D DWT architecture with existing 3-D DWT architecture. 
It is found that, the proposed design has less memory requirement. High throughput, less computation time 
and minimal latency compared to [fT^ . fl^ . fl2^ . and f[24ll . Though the proposed 3-D DWT architecture 
has small disadvantage in area and frequency, when compared to [l22ll . the proposed one has a great 
advantage in remaining all aspects. 

Table |V] gives the comparison of synthesis results between the proposed 3-D DWT architecture and 
[[24|. It seems to be proposed one occupying more cell area, but it included total on chip memory also, 
where as in Il24ll on chip memory is not included. Power consumption of the proposed 3-D architecture 
is very less compared to ll24l . 


V. Conclusions 

In this paper, we have proposed memory efficient and high throughput architecture for lifting based 
3-D DWT. The proposed architecture is implemented on 7z020clg484 EPCA target of zynq family, also 



12 


synthesized on Synopsys’ design vision for ASIC implementation. An effieient design of 2-D spatial pro- 
eessor and 1-D temporal proeessor reduees the internal memory, lateney, CPD and eomplexity of a eontrol 
unit, and inereases the throughput. When eompared with the existing arehiteetures the proposed seheme 
shows higher performanee at the eost of slight inerease in area. The proposed 3-D DWT arehiteeture is 
eapable of eomputing 60 UHD (3840x2160) frames in a seeond. 
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